Introduction

We were given the data set of The National Health and Nutrition Examination Survey (NHANES). The survey program has been conducted as a series of surveys designed to assess the health and nutritional status of adults and children in the United States since the 1960s, according to CDC (2023). It combines in-person face-to-face interviews and physical examinations of participants for data collection.

The survey data wasn’t a simple random sample, however. According to CDC’s National Health and Nutrition Examination Survey: Plan and Operations, 1999–2010 (G et al. 2013), the sampling strategy consists of several stages: 1. Selection of counties as primary sampling units (PSU). 2. selection of segments within PSUs that constitute blocks of households. 3. Selection of specific households within segments. 4. Selection of individuals within a household.

We aim to study the relationship between the weight variable and the other health related variables of the data.

Method

We began our study by doing an exploratory analysis among the variables through various tables and charts. We then performed several hypothesis tests on some of the variables. Lastly we did a linear regression model fit to the response variable “weight” with other variables and confounders.

Part 1: Exploratory Analysis

We began our analysis by giving a data dictionary of the data shown in Table 1 below. As one can see that some variables have a high percentage of missing values. In Part 2 we made hypothesis tests to decide if some of these variables could be excluded from the regression analysis in Part 3.

The weight variable was a continuous random variable in our data. A simple way of categorizing it was to consider the BMI indicator. As one could see there was an obese variable in the data. The weight variable was categorized by giving a threshold of 35 to the BMI value. A person is considered healthy if the BMI is below 35, and obese otherwise. Therefore, we used the obese variable as the categorical random variable in our project.

From the table 1, we can realized that there are 6445 observations with 21 variables in our data set and 8 variables can be considered as categorical variables. But in original data set, it exists 6482 observations and 37 of them are missing information for variable bmi. We deleted these missing data and use BMI level to stratified observations, since the missing data less than 0.6% in total.

## 
## Attaching package: 'table1'
## The following objects are masked from 'package:base':
## 
##     units, units<-
Table 1: Characteristics of the data set NHANES
Healthy weight Obesity Overweight Underweight Overall
(N=1883) (N=2311) (N=2127) (N=124) (N=6445)
Gender
Male 897 (47.6%) 1036 (44.8%) 1171 (55.1%) 40 (32.3%) 3144 (48.8%)
Female 986 (52.4%) 1275 (55.2%) 956 (44.9%) 84 (67.7%) 3301 (51.2%)
Age (years)
Mean (SD) 41.2 (20.6) 48.7 (17.7) 48.9 (19.0) 37.9 (21.0) 46.4 (19.4)
Median [Min, Max] 37.0 [16.0, 80.0] 49.0 [16.0, 80.0] 49.0 [16.0, 80.0] 30.0 [16.0, 80.0] 46.0 [16.0, 80.0]
Marital Status
Married 741 (39.4%) 1158 (50.1%) 1074 (50.5%) 31 (25.0%) 3004 (46.6%)
Widowed 121 (6.4%) 185 (8.0%) 190 (8.9%) 8 (6.5%) 504 (7.8%)
Divorced 154 (8.2%) 262 (11.3%) 210 (9.9%) 14 (11.3%) 640 (9.9%)
Separated 47 (2.5%) 82 (3.5%) 63 (3.0%) 1 (0.8%) 193 (3.0%)
Never Married 351 (18.6%) 353 (15.3%) 289 (13.6%) 30 (24.2%) 1023 (15.9%)
Living Together 141 (7.5%) 148 (6.4%) 157 (7.4%) 8 (6.5%) 454 (7.0%)
Missing 328 (17.4%) 123 (5.3%) 144 (6.8%) 32 (25.8%) 627 (9.7%)
Statistical Weight
Mean (SD) 36700 (26000) 33000 (25100) 34200 (26300) 37400 (27800) 34600 (25800)
Median [Min, Max] 26100 [5050, 154000] 23200 [4080, 124000] 23600 [4450, 141000] 26500 [6840, 113000] 24200 [4080, 154000]
Pseudo-PSU
Mean (SD) 1.51 (0.500) 1.50 (0.500) 1.51 (0.500) 1.50 (0.502) 1.51 (0.500)
Median [Min, Max] 2.00 [1.00, 2.00] 2.00 [1.00, 2.00] 2.00 [1.00, 2.00] 1.50 [1.00, 2.00] 2.00 [1.00, 2.00]
Pseudo-stratum
Mean (SD) 7.11 (4.09) 7.36 (4.13) 7.15 (4.16) 7.80 (4.14) 7.22 (4.13)
Median [Min, Max] 7.00 [1.00, 15.0] 7.00 [1.00, 15.0] 7.00 [1.00, 15.0] 8.00 [1.00, 15.0] 7.00 [1.00, 15.0]
Total Cholesterol (mg/dL)
Mean (SD) 185 (39.9) 194 (40.5) 198 (42.8) 172 (33.4) 192 (41.4)
Median [Min, Max] 180 [92.0, 383] 191 [92.0, 357] 194 [90.0, 380] 166 [108, 289] 189 [90.0, 383]
Missing 123 (6.5%) 142 (6.1%) 121 (5.7%) 6 (4.8%) 392 (6.1%)
HDL-Cholesterol (mg/dL)
Mean (SD) 58.4 (17.1) 47.6 (13.7) 51.8 (15.5) 63.3 (17.1) 52.5 (16.0)
Median [Min, Max] 56.0 [11.0, 144] 46.0 [15.0, 115] 50.0 [16.0, 119] 63.0 [26.0, 114] 50.0 [11.0, 144]
Missing 124 (6.6%) 142 (6.1%) 120 (5.6%) 6 (4.8%) 392 (6.1%)
Systolic Blood Pressure (mm Hg)
Mean (SD) 119 (18.5) 125 (17.3) 125 (18.5) 111 (18.5) 123 (18.3)
Median [Min, Max] 116 [90.0, 220] 124 [90.0, 200] 122 [90.0, 208] 106 [90.0, 220] 120 [90.0, 220]
Missing 164 (8.7%) 206 (8.9%) 154 (7.2%) 20 (16.1%) 544 (8.4%)
Diastolic Blood Pressure (mm Hg)
Mean (SD) 67.4 (11.2) 71.3 (12.4) 69.8 (11.8) 65.7 (11.3) 69.6 (11.9)
Median [Min, Max] 68.0 [40.0, 118] 72.0 [40.0, 134] 70.0 [40.0, 118] 66.0 [44.0, 110] 70.0 [40.0, 134]
Missing 167 (8.9%) 230 (10.0%) 170 (8.0%) 18 (14.5%) 585 (9.1%)
Weight (Kg)
Mean (SD) 63.1 (9.13) 99.0 (17.7) 77.3 (10.3) 47.9 (5.56) 80.4 (20.2)
Median [Min, Max] 62.8 [38.5, 95.5] 96.9 [57.8, 159] 76.8 [45.5, 117] 47.7 [33.2, 63.0] 77.6 [33.2, 159]
Standing Height (cm)
Mean (SD) 168 (10.0) 167 (10.4) 168 (10.4) 166 (7.59) 167 (10.2)
Median [Min, Max] 167 [140, 203] 166 [135, 196] 168 [123, 202] 165 [147, 186] 167 [123, 203]
Vigorous Work Activity
Yes 324 (17.2%) 418 (18.1%) 371 (17.4%) 16 (12.9%) 1129 (17.5%)
No 1558 (82.7%) 1893 (81.9%) 1756 (82.6%) 108 (87.1%) 5315 (82.5%)
Missing 1 (0.1%) 0 (0%) 0 (0%) 0 (0%) 1 (0.0%)
Moderate Work Activity
Yes 651 (34.6%) 796 (34.4%) 701 (33.0%) 32 (25.8%) 2180 (33.8%)
No 1231 (65.4%) 1515 (65.6%) 1426 (67.0%) 92 (74.2%) 4264 (66.2%)
Missing 1 (0.1%) 0 (0%) 0 (0%) 0 (0%) 1 (0.0%)
Walk or Bicycle
Yes 630 (33.5%) 549 (23.8%) 573 (26.9%) 48 (38.7%) 1800 (27.9%)
No 1252 (66.5%) 1762 (76.2%) 1554 (73.1%) 76 (61.3%) 4644 (72.1%)
Missing 1 (0.1%) 0 (0%) 0 (0%) 0 (0%) 1 (0.0%)
Vigorous Recreational Activities
Yes 579 (30.7%) 344 (14.9%) 449 (21.1%) 27 (21.8%) 1399 (21.7%)
No 1303 (69.2%) 1967 (85.1%) 1678 (78.9%) 97 (78.2%) 5045 (78.3%)
Missing 1 (0.1%) 0 (0%) 0 (0%) 0 (0%) 1 (0.0%)
Moderate Recreational Activities
Yes 834 (44.3%) 791 (34.2%) 823 (38.7%) 37 (29.8%) 2485 (38.6%)
No 1048 (55.7%) 1520 (65.8%) 1303 (61.3%) 87 (70.2%) 3958 (61.4%)
Missing 1 (0.1%) 0 (0%) 1 (0.0%) 0 (0%) 2 (0.0%)
Minutes of Sedentary Activity per Week (mins)
Mean (SD) 316 (185) 333 (186) 308 (184) 366 (195) 321 (186)
Median [Min, Max] 300 [0, 840] 300 [0, 840] 300 [1.00, 840] 300 [10.0, 840] 300 [0, 840]
Missing 17 (0.9%) 34 (1.5%) 26 (1.2%) 1 (0.8%) 78 (1.2%)
Obese
No 1883 (100%) 1325 (57.3%) 2127 (100%) 124 (100%) 5459 (84.7%)
Yes 0 (0%) 986 (42.7%) 0 (0%) 0 (0%) 986 (15.3%)
Data Variable Definition
Variables Type Example Number.Unique MissingPct Comment
id integer 1, 2, 3 6482 0% Identification Code (1 - 6482)
gender factor Male, Female 2 0% Gender (1: Male, 2: Female)
age integer 34, 16, 60 65 0% Age (Years)
marstat factor Married, NA, Widowed 6 9.7% Marital Status (1: Married, 2: Widowed, 3: Divorced, 4: Separated, 5: Never Married, 6: Living Together)
samplewt numeric 80100.544, 13953.078, 20090.339 2499 0% Statistical Weight (4084.478 - 153810.3)
psu integer 1, 2 2 0% Pseudo-PSU (1, 2)
strata integer 9, 10, 1 15 0% Pseudo-Stratum (1 - 15)
tchol integer 135, 192, 202 251 6.09% Total Cholesterol (mg/dL)
hdl integer 50, 60, 45 112 6.09% HDL-Cholesterol (mg/dL)
sysbp integer 114, 112, 154 61 8.53% Systolic Blood Pressure (mm Hg)
dbp integer 88, 62, 70 40 9.16% Diastolic Blood Pressure (mm Hg)
wt numeric 87.400002, 72.300003, 116.8 957 0.57% Weight (kg)
ht numeric 164.7, 181.3, 166 527 0.57% Standing Height (cm)
bmi numeric 32.22, 22, 42.39 2276 0.57% Body mass Index (Kg/m^2)
vigwrk factor No, Yes, NA 2 0.02% Vigorous Work Activity (1: Yes, 2: No)
modwrk factor No, Yes, NA 2 0.02% Moderate Work Activity (1: Yes, 2: No)
wlkbik factor No, Yes, NA 2 0.02% Walk or Bicycle (1: Yes, 2: No)
vigrecexr factor No, Yes, NA 2 0.02% Vigorous Recreational Activities (1: Yes, 2: No)
modrecexr factor No, Yes, NA 2 0.03% Moderate Recreational Activities (1: Yes, 2: No)
sedmin integer 480, 240, 720 37 1.22% Minutes of Sedentary Activity per Week (0 - 840)
obese factor No, Yes, NA 2 0.57% BMI>35 (1: No, 2: Yes)

According to CDC’s classification on bodyweight, we have: BMI<18.5 as Underweight, BMI between 18.5 and 24.9 as Health, BMI between 25 and 29.9 as Overweight, and BMI>30 as obesity. We adopted this category and found that there was a slight positive relationship between bodyweight and the total cholesterol level. However, we noticed that there was a negative relationship between the HDL and bodyweight. Because of the fact that Tchol is the sum of HDL and LDL, we can conclude that the obese population has a high level of LDL and a low level HDL.

According to ATPIII (n.d.), we can also categorize the cholesterol level.

This data set mainly focus on the observers between 16 to 80 years old. Among them, the average weight for male is greater than female among all ages, and as we can see from the line chart that the change in average weight with age follow the same trend across the gender, with a general tendency to sustained increase, followed by fluctuation and continuous decrease finally. It can be considered as there might exist some relationship with weight and age.

## `summarise()` has grouped output by 'gender'. You can override using the
## `.groups` argument.

For the observations in different marital status, we also interested in the relationship between weight and marital status.The following box plot shows that the median weight under different marital status are all around 80 Kg, widowed observation have lowest weight among six categories. Married and Never Married observations have more people heavier than 130 Kg than other categories, which may not good for health.

Different types of work and recreational activities also are interesting variables to discuss.We can see that only 1 missing data in these four variables,so we omit this missing data directly. For the vigorous activities, observations not in both work and recreational activities got the lowest and highest BMI. No matter the condition on work activities, upper quartile for observations are below 30 and lower quartile greater than 20 in observations have vigorous recreational activities. This means that majority people in this condition have a healthy BMI index. For moderate activities, observation not in both moderate work and recreational activities have the highest BMI and observation only in moderate work activities have the lowest BMI. No matter the condition on work activities, upper quartile for observations are slightly above 30 and lower quartile greater than 20 in observations have vigorous recreational activities. In general, we can see from two plots that observations have vigorous or moderate recreational activities trend to have healthy BMI index than others,and under same recreational activities condition observations in moderate or vigorous work activities have less number of observations have high BMI index. Hence, different types of work or recreational activities may have relationship with weight and affect BMI index in this way.

Part 2: Hypothesis Tests

We first test the independence between obesity and marital status. We form the following contingency table:
Contingency Table
Obesity
No Yes
Marital Status Married 2530 474
Widowed 418 86
Divorced 528 112
Separated 158 35
Never Married 863 160
Living Together 388 66

Let X be the categorical random variable for Marital Status and Y be the one for Obesity. Assuming a random sample of n trials. Define the count random variable \(N_{ij}:=\sum_{k=1}^n \mathbf{I}_k(X=i, Y=j)\) where \(\mathbf{I}_k\) is the indicator function for the k-th trial, then the joint random variables \([N_{11}, ..., N_{IJ}]\) has a Multinomial distribution \(\vec{p}=[p_{11}, ..., p_{IJ}]\). Our hypothesis test is therefore:

\[\begin{gather*} H_0: p_{ij}= p_{i+} \cdot p_{+j} ~ \forall i,j\\ H_1:p_{ij} \neq p_{i+} \cdot p_{+j} ~ \forall i,j \end{gather*}\]

We use the chi-squared test to conclude that there is not enough evidence to reject the null hypothesis with a p-value equal to 0.6894. In other words, we cannot conclude that there is a relationship between obesity and marital status.

We do the same test for other variables compared with obesity. From Table 2 we can see that we can reject the independence between obesity and wlkbik, vigrecexr and modrecexr variables.

p-values of Independence Tests between Different Variables and Obesity
vigwrk modwrk wlkbik vigrecexr modrecexr
p-value 0.5695 0.3037 1.064e-07 4.061e-15 2.573e-09

Conclusion

References

2023. https://www.cdc.gov/nchs/nhanes/about_nhanes.htm.
———. n.d.
G, Zipf, Chiappa M, Porter KS, et al. 2013. “National Health and Nutrition Examination Survey: Plan and Operations, 1999–2010.” National Center for Health Statistics 1 (56).